From R. A. Fisher, 1938: To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of

Attribution

This work can be reproduced under licence CC BY-NC, assuming you give attribution to the original creators. See authors of this document and the ‘resources’ document in the Github repo for more details.

Loading the packages used in this demo

library(tidyverse)
library(gganimate)
library(dslabs)
library(lubridate)
library(ggplot2)
library(gapminder)
library(gganimate)
# library(gifski)
library(tidyverse)

Measles example

Every data cycle starts with a question / problem:

  • Problem
  • Plan
  • Data
  • Analysis
  • Conclusion
DiagrammeR::grViz("
digraph rmarkdown {
  Problem -> Plan
  Plan -> Data
  Data -> Analysis
  Analysis -> Conclusion 
  Conclusion -> Problem
}
")

Was the introduction of the vaccination against the measles effective in the USA?

library(dslabs)
data(us_contagious_diseases)
the_disease <- "Measles"
dat <- us_contagious_diseases %>%
  dplyr::filter(!state%in%c("Hawaii","Alaska") &  
      disease == the_disease) %>%
  mutate(rate = count / population * 10000 * 52 / weeks_reporting) 

jet.colors <- colorRampPalette(c("#F0FFFF", "cyan", "#007FFF", "yellow", "#FFBF00", "orange", "red", "#7F0000"), bias = 2.25)

dat %>% mutate(state = reorder(state, desc(state))) %>%
  ggplot(aes(year, state, fill = rate)) +
  geom_tile(color = "white", size = 0.35) +
  scale_x_continuous(expand = c(0,0)) +
  scale_fill_gradientn(colors = jet.colors(16), na.value = 'white') +
  geom_vline(xintercept = 1963, col = "black") +
  theme_minimal() + 
  theme(panel.grid = element_blank()) +
  coord_cartesian(clip = 'off') +
  ggtitle(the_disease) +
  ylab("") +
  xlab("") +  
  theme(legend.position = "bottom", text = element_text(size = 8)) + 
  annotate(geom = "text", x = 1963, y = 50.5, label = "Vaccine introduced", size = 3, hjust = 0)

Build upon existing code

Try changing the line in the code chunk above from

the_disease <- "Measles" (Mazelen in Dutch) to

the_disease <- "Smallpox" (Pokken in Dutch) and rerun the code.

Can you find out which part of the code above plots the vertical line and prints the words “Vaccine introduced”. Try to adapt the figure to the Smallpox data: The introduction of the smallpox vaccine was already as early as in 1940.

TIP

You will need to change the year 1963 twice to 1940 in order for the new figure to be correct for the Smallpox subset of the data. Try adapting the code and rerun the code chunk.

Question: When did vaccination against smallpox stop? Try googling for the answer!

Exploring the Gapminder dataset in R

The gapminder dataset contains a number of measurements on health and income outcomes for 184 countries from 1960 to 2016. It also includes two character vectors, OECD and OPEC, with the names of OECD and OPEC countries from 2016.

Inspecting the dataset with R

data("gapminder", package = "dslabs")
gapminder <- gapminder %>% 
  as_tibble()
gapminder
## # A tibble: 10,545 x 9
##    country  year infant_mortality life_expectancy fertility population
##    <fct>   <int>            <dbl>           <dbl>     <dbl>      <dbl>
##  1 Albania  1960            115.             62.9      6.19    1636054
##  2 Algeria  1960            148.             47.5      7.65   11124892
##  3 Angola   1960            208              36.0      7.32    5270844
##  4 Antigu…  1960             NA              63.0      4.43      54681
##  5 Argent…  1960             59.9            65.4      3.11   20619075
##  6 Armenia  1960             NA              66.9      4.55    1867396
##  7 Aruba    1960             NA              65.7      4.82      54208
##  8 Austra…  1960             20.3            70.9      3.45   10292328
##  9 Austria  1960             37.3            68.8      2.7     7065525
## 10 Azerba…  1960             NA              61.3      5.57    3897889
## # … with 10,535 more rows, and 3 more variables: gdp <dbl>,
## #   continent <fct>, region <fct>
names(gapminder)
## [1] "country"          "year"             "infant_mortality"
## [4] "life_expectancy"  "fertility"        "population"      
## [7] "gdp"              "continent"        "region"

A very simple example to start with

gapminder %>% 
  ggplot(aes(x = fertility,
             y = life_expectancy)) +
  geom_point()

Fixing overplotting

We call this overlay of points ‘overplotting’.

This can be fixed in several ways:

  • Reducing the transparency of data points
  • Mapping colour to a variable (continuous or categorical)
  • Reduce the data in the plot
  • Mapping a shape to a variable
  • Add noise ("jitter") to points
  • Choosing different geom_
  • Facetting; create panels for ‘categorical’ or so-called ‘factor’ variables in R
  • Summarize the data
  • Displaying a model / relationship that represents the data (and not sho the actual data itself)
  • Or any combination of the above strategies

Basically you map an aesthetic (aes()) or a characteristic to a variable

Let’s go over a few of these overplotting methods

1. Overplotting: Reducing transparency (alpha) of points or lines in the data

gapminder %>% 
  ggplot(aes(x = fertility,
             y = life_expectancy)) +
  geom_point(alpha = 0.1)

2. Mapping colour to a variable (continuous or categorical)

gapminder %>% 
  ggplot(aes(x = fertility,
             y = life_expectancy)) +
  geom_point(aes(colour = continent))

or combined with alpha

gapminder %>% 
  ggplot(aes(x = fertility,
             y = life_expectancy)) +
  geom_point(aes(colour = continent), alpha = 0.1) +
  guides(colour = guide_legend(override.aes = list(alpha = 1)))

Choose a different geom_

# install.packages("ggpointdensity")
library(ggpointdensity)

gapminder %>% 
  ggplot(aes(x = fertility,
             y = life_expectancy)) +
  stat_density2d(aes(fill = continent), geom = 'polygon', alpha = 0.5) +
  guides(fill = guide_legend(override.aes = list(alpha = 1))) 

Do it yourself:

  • Try adjusting some of the arguments in the previous ggplot2 call. For example, adjust the alpha = ... or change the variable in x = ..., y = ... or colour = ...
  • names(gapminder) gives you the variable names that you can change
  • Show and discuss the resulting plot with your neighbour
  • What do you think this part does:

guides(... = guide_legend(override.aes = list(alpha = 1)))

  • Try to find out by disabling with #

3. Reduce the data in the plot

reduce_data_plot <- gapminder %>% 
  filter(continent == "Africa" | continent == "Europe") %>%
  ggplot(aes(x = fertility,
             y = life_expectancy)) +
  geom_point(aes(colour = continent), alpha = 0.2) +
  ## override the alpha setting for the points in the legend:
  guides(colour = guide_legend(override.aes = list(alpha = 1))) 

reduce_data_plot

Discuss with you neighbour:

  • What does the the aes() part of the geom_point() do?

4. Mapping a shape to a variable

## or e.g. show only two years and map a shape to continent
shape_plot <- gapminder %>% 
  dplyr::filter(continent == "Africa" | continent == "Europe",
         year == "1960" | year == "2010") %>%
  ggplot(aes(x = fertility,
             y = life_expectancy)) +
  geom_point(aes(colour = as_factor(as.character(year)), 
                 shape = continent), 
             alpha = 0.7)

shape_plot

5. Facetting

Create panels for ‘categorical’ or so-called ‘factor’ variables in R

facets_plot <- gapminder %>% 
  dplyr::filter(continent == "Africa" | continent == "Europe",
         year == "1960" | year == "2010") %>%
  ggplot(aes(x = fertility,
             y = life_expectancy)) +
  geom_point(aes(colour = continent), alpha = 0.5) +
  facet_wrap(~ year)

facets_plot

6. Summarize the data

library(ggrepel)

years <- c("1960", "1970", "1980", "1990", "2000", "2010")

summarize_plot <- gapminder %>% 
  dplyr::filter(year %in% years) %>%
  group_by(continent, year) %>%
  summarise(mean_life_expectancy = mean(life_expectancy),
            mean_fertility = mean(fertility)) %>%
  ggplot(aes(x = mean_fertility,
             y = mean_life_expectancy)) +
  geom_point(aes(colour = continent), alpha = 0.7) 

summarize_plot

Adding labels to points with {ggrepel}

library(ggrepel)

years <- c("1960", "1970", "1980", "1990", "2000", "2010")

labels_plot <- gapminder %>% 
  dplyr::filter(year %in% years) %>%
  group_by(continent, year) %>%
  summarise(mean_life_expectancy = mean(life_expectancy),
            mean_fertility = mean(fertility)) %>%
  ggplot(aes(x = mean_fertility,
             y = mean_life_expectancy)) +
  geom_point(aes(colour = continent), alpha = 0.7) +
  geom_label_repel(aes(label=year), size = 2.5, box.padding = .5)
  

labels_plot

7. Displaying a model

## Model
lm <- gapminder %>% lm(formula = life_expectancy ~ fertility)

correlation <- cor.test(x = gapminder$fertility, y = gapminder$life_expectancy, method = "pearson")

# save predictions of the model in the new data frame 
# together with variable you want to plot against
predicted_df <- data.frame(gapminder_pred = predict(lm, gapminder), 
                           fertility = gapminder$fertility)

Add model to plot

model_plot <- gapminder %>% 
  ggplot(aes(x = fertility,
             y = life_expectancy)) +
  geom_point(alpha = 0.03) +
  geom_line(data = predicted_df, aes(x = fertility, 
                                     y = gapminder_pred),
            colour = "darkred", size = 1)
model_plot

Plotting statistics to the graph with the {ggpubr} package

Using a smoother geom_smooth to display potential relationships

gapminder %>% 
  ggplot(aes(x = fertility,
             y = life_expectancy)) +
  geom_point(alpha = 0.02) +
  geom_smooth(method = "lm") +
  stat_cor(method = "pearson", label.x = 2, label.y = 30) +
  theme_bw()

Time series

So far we have been mapping colours and shapes to categorical variables. You can also map to continuous variables though.

continuous <- gapminder %>%
  dplyr::filter(country == "Netherlands" | 
                country == "China" |
                country == "India") %>%
  dplyr::filter(year %in% years) %>%
  ggplot(aes(x = year,
         y = life_expectancy)) +
  geom_point(aes(size = population, colour = country)) +
  guides(colour = guide_legend(override.aes = list(alpha = 1))) +
  geom_line(aes(group = country)) +
  theme_bw()

continuous

Summarize per continent and sum population

population_plot <- gapminder %>% 
  dplyr::group_by(continent, year) %>%
  dplyr::filter(year %in% years) %>%
  summarise(sum_population = sum(population)) %>% 
  ggplot(aes(x = year, 
             y = sum_population)) +
    geom_point(aes(colour = continent)) +
    geom_line(aes(group = continent,
                  colour = continent))
population_plot

Filtering for one country

dslabs::gapminder %>%
  dplyr::filter(country == "Rwanda") %>%
  ggplot(aes(x = year, y = life_expectancy)) +
  geom_point() +
  geom_line() +
  ggtitle("Life expectancy for Rwanda over the years") 

Ranking data

ranking_plot <- gapminder %>%
  dplyr::filter(continent == "Europe",
                year == 2010) %>%
  ggplot(aes(x = reorder(as_factor(country), population),
             y = log10(population))) +
  geom_point() +
  ylab("log10(Population)") +
  xlab("Country") + 
  coord_flip() +
  geom_point(data = filter(gapminder %>%
  dplyr::filter(continent == "Europe",
                year == 2010), population >= 1e7), colour = "red")

ranking_plot

Grouping

Plot individual datapoints over a group aesthetic. Note the outliers!!

gapminder %>% 
  dplyr::filter(continent == "Americas" | continent == "Africa") %>%
  ggplot(aes(x = year,
             y = life_expectancy)) +
  geom_point(aes(colour = country)) +
  theme(legend.position="none") +
  facet_wrap( ~ continent) +
  theme(legend.position="none") 

Summarizing time series data

gapminder$continent %>% as_factor() %>% levels()
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"
gapminder %>% 
  dplyr::filter(continent == "Americas" | continent == "Africa") %>%
  group_by(continent, year) %>%
  summarise(mean_life_expectancy = mean(life_expectancy)) %>%
  ggplot(aes(x = year,
             y = mean_life_expectancy)) +
  geom_line(aes(group = continent,
                colour = continent)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Statistical inference

df <- gapminder %>% 
  dplyr::filter(continent == "Americas" | continent == "Oceania") %>%
  group_by(continent, year)

model <- aov(data = df, life_expectancy ~ continent * year)
anova(model)
## Analysis of Variance Table
## 
## Response: life_expectancy
##                  Df Sum Sq Mean Sq  F value Pr(>F)    
## continent         1   8982    8982  269.104 <2e-16 ***
## year              1  58606   58606 1755.931 <2e-16 ***
## continent:year    1      9       9    0.278 0.5981    
## Residuals      2732  91183      33                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Some remarks on the above Two-way ANOVA:

  • Repeated measures / multilevel models might be more appropriate here (paired / nested designs)
  • We did not perform any check on assumptions
  • We performed our analysis on only part of the data

One more option: categorical values and “jitter”

Sometimes you have overlapping plots and adding transparency with alpha() or mapping colour to underlying categorical values is not working because there are simple to many points overlapping

Let’s look at an example

gapminder %>% 
  dplyr::filter(continent == "Americas" |
                continent == "Africa") %>%
  group_by(continent) %>%
  dplyr::filter(year %in% years) %>%
  ggplot(aes(x = year,
             y = infant_mortality)) +
  geom_point(aes(colour = country)) +
  theme(legend.position="none")

In such cases it can be helpfull to add some noise to the points (position = "jitter") to reduce overlapping. This can be a powerfull approach, especially when combined with setting alpha()

gapminder %>% 
  dplyr::filter(continent == "Americas" |
                continent == "Africa") %>%
  dplyr::filter(year %in% years) %>%
    group_by(continent) %>%
  ggplot(aes(x = year,
             y = infant_mortality)) +
  geom_point(aes(colour = continent), position = "jitter") 

Bar chart

It would be nice to know what the mean child mortality is for both continents

gapminder %>% 
  dplyr::filter(continent == "Americas" |
                continent == "Africa") %>%
  dplyr::filter(year %in% years) %>%
  group_by(continent, year) %>%
  summarise(mean_infant_mortality = mean(infant_mortality, na.rm = TRUE)) %>% 
  ggplot(aes(x = year,
             y = mean_infant_mortality)) +
  geom_col(aes(fill = continent), position = "dodge") 

Adding summary data to an existing plot

Now that we have the mean infant mortality for each year for the two continents, let’s add that data to the previous dot plot where we used jitter

mean_inf_mort <- gapminder %>% 
  dplyr::filter(continent == "Americas" |
                continent == "Africa") %>%
  dplyr::filter(year %in% years) %>%
  group_by(continent, year) %>%
  summarise(mean_infant_mortality = mean(infant_mortality, na.rm = TRUE))

gapminder %>% 
  dplyr::filter(continent == "Americas" |
                continent == "Africa") %>%
  dplyr::filter(year %in% years) %>%
    group_by(continent) %>%
  ggplot(aes(x = year,
             y = infant_mortality)) +
  geom_point(aes(colour = continent), position = "jitter") +

## summary data added to previous 
  geom_line(data = mean_inf_mort, aes(x = year, 
                                      y = mean_infant_mortality, 
                                      colour = continent),  size = 2)

Filter data from a graph

In the figure above we can observe a number of countries in ‘Americas’ continent that have a child mortality that are above the average (over the years) of ‘Africa’. Which countries are this?

library(ggiraph)

gapminder$country <- 
  str_replace_all(string = gapminder$country, 
                pattern = "'", 
                replacement = "_")


interactive_inf_mort <- gapminder %>% 
  dplyr::filter(continent == "Americas" |
                continent == "Africa") %>%
  dplyr::filter(year %in% years) %>%
    group_by(region, country) %>%
  ggplot(aes(x = year,
             y = infant_mortality)) +
  
  geom_point_interactive(aes(tooltip = country, colour = region), position = "jitter") +
  
#  geom_point(aes(colour = continent), position = "jitter") +

## summary data added to previous 
 geom_line(data = mean_inf_mort, aes(x = year, 
                                      y = mean_infant_mortality, 
                                      colour = continent, group = continent),  size = 2
            )

interactive_inf_mort

gapminder$country %>% as_factor() %>% levels()
##   [1] "Albania"                        "Algeria"                       
##   [3] "Angola"                         "Antigua and Barbuda"           
##   [5] "Argentina"                      "Armenia"                       
##   [7] "Aruba"                          "Australia"                     
##   [9] "Austria"                        "Azerbaijan"                    
##  [11] "Bahamas"                        "Bahrain"                       
##  [13] "Bangladesh"                     "Barbados"                      
##  [15] "Belarus"                        "Belgium"                       
##  [17] "Belize"                         "Benin"                         
##  [19] "Bhutan"                         "Bolivia"                       
##  [21] "Bosnia and Herzegovina"         "Botswana"                      
##  [23] "Brazil"                         "Brunei"                        
##  [25] "Bulgaria"                       "Burkina Faso"                  
##  [27] "Burundi"                        "Cambodia"                      
##  [29] "Cameroon"                       "Canada"                        
##  [31] "Cape Verde"                     "Central African Republic"      
##  [33] "Chad"                           "Chile"                         
##  [35] "China"                          "Colombia"                      
##  [37] "Comoros"                        "Congo, Dem. Rep."              
##  [39] "Congo, Rep."                    "Costa Rica"                    
##  [41] "Cote d_Ivoire"                  "Croatia"                       
##  [43] "Cuba"                           "Cyprus"                        
##  [45] "Czech Republic"                 "Denmark"                       
##  [47] "Djibouti"                       "Dominican Republic"            
##  [49] "Ecuador"                        "Egypt"                         
##  [51] "El Salvador"                    "Equatorial Guinea"             
##  [53] "Eritrea"                        "Estonia"                       
##  [55] "Ethiopia"                       "Fiji"                          
##  [57] "Finland"                        "France"                        
##  [59] "French Polynesia"               "Gabon"                         
##  [61] "Gambia"                         "Georgia"                       
##  [63] "Germany"                        "Ghana"                         
##  [65] "Greece"                         "Greenland"                     
##  [67] "Grenada"                        "Guatemala"                     
##  [69] "Guinea"                         "Guinea-Bissau"                 
##  [71] "Guyana"                         "Haiti"                         
##  [73] "Honduras"                       "Hong Kong, China"              
##  [75] "Hungary"                        "Iceland"                       
##  [77] "India"                          "Indonesia"                     
##  [79] "Iran"                           "Iraq"                          
##  [81] "Ireland"                        "Israel"                        
##  [83] "Italy"                          "Jamaica"                       
##  [85] "Japan"                          "Jordan"                        
##  [87] "Kazakhstan"                     "Kenya"                         
##  [89] "Kiribati"                       "South Korea"                   
##  [91] "Kuwait"                         "Kyrgyz Republic"               
##  [93] "Lao"                            "Latvia"                        
##  [95] "Lebanon"                        "Lesotho"                       
##  [97] "Liberia"                        "Libya"                         
##  [99] "Lithuania"                      "Luxembourg"                    
## [101] "Macao, China"                   "Macedonia, FYR"                
## [103] "Madagascar"                     "Malawi"                        
## [105] "Malaysia"                       "Maldives"                      
## [107] "Mali"                           "Malta"                         
## [109] "Mauritania"                     "Mauritius"                     
## [111] "Mexico"                         "Micronesia, Fed. Sts."         
## [113] "Moldova"                        "Mongolia"                      
## [115] "Montenegro"                     "Morocco"                       
## [117] "Mozambique"                     "Namibia"                       
## [119] "Nepal"                          "Netherlands"                   
## [121] "New Caledonia"                  "New Zealand"                   
## [123] "Nicaragua"                      "Niger"                         
## [125] "Nigeria"                        "Norway"                        
## [127] "Oman"                           "Pakistan"                      
## [129] "Panama"                         "Papua New Guinea"              
## [131] "Paraguay"                       "Peru"                          
## [133] "Philippines"                    "Poland"                        
## [135] "Portugal"                       "Puerto Rico"                   
## [137] "Qatar"                          "Romania"                       
## [139] "Russia"                         "Rwanda"                        
## [141] "St. Lucia"                      "St. Vincent and the Grenadines"
## [143] "Samoa"                          "Saudi Arabia"                  
## [145] "Senegal"                        "Serbia"                        
## [147] "Seychelles"                     "Sierra Leone"                  
## [149] "Singapore"                      "Slovak Republic"               
## [151] "Slovenia"                       "Solomon Islands"               
## [153] "South Africa"                   "Spain"                         
## [155] "Sri Lanka"                      "Sudan"                         
## [157] "Suriname"                       "Swaziland"                     
## [159] "Sweden"                         "Switzerland"                   
## [161] "Syria"                          "Tajikistan"                    
## [163] "Tanzania"                       "Thailand"                      
## [165] "Timor-Leste"                    "Togo"                          
## [167] "Tonga"                          "Trinidad and Tobago"           
## [169] "Tunisia"                        "Turkey"                        
## [171] "Turkmenistan"                   "Uganda"                        
## [173] "Ukraine"                        "United Arab Emirates"          
## [175] "United Kingdom"                 "United States"                 
## [177] "Uruguay"                        "Uzbekistan"                    
## [179] "Vanuatu"                        "Venezuela"                     
## [181] "West Bank and Gaza"             "Vietnam"                       
## [183] "Yemen"                          "Zambia"                        
## [185] "Zimbabwe"
ggiraph(ggobj = interactive_inf_mort)

A more complicated example (for showing the capabilities of ggplot2)

west <- c("Western Europe","Northern Europe","Southern Europe",
          "Northern America","Australia and New Zealand")

gapminder <- gapminder %>%
  mutate(group = case_when(
    region %in% west ~ "The West",
    region %in% c("Eastern Asia", "South-Eastern Asia") ~ "East Asia",
    region %in% c("Caribbean", "Central America", "South America") ~ "Latin America",
    continent == "Africa" & region != "Northern Africa" ~ "Sub-Saharan Africa",
    TRUE ~ "Others"))
gapminder <- gapminder %>%
  mutate(group = factor(group, levels = rev(c("Others", "Latin America", "East Asia","Sub-Saharan Africa", "The West"))))

filter(gapminder, year%in%c(1962, 2013) & !is.na(group) &
         !is.na(fertility) & !is.na(life_expectancy)) %>%
  mutate(population_in_millions = population/10^6) %>%
  ggplot( aes(fertility, y=life_expectancy, col = group, size = population_in_millions)) +
  geom_point(alpha = 0.8) +
  guides(size=FALSE) +
  theme(plot.title = element_blank(), legend.title = element_blank()) +
  coord_cartesian(ylim = c(30, 85)) +
  xlab("Fertility rate (births per woman)") +
  ylab("Life Expectancy") +
  geom_text(aes(x=7, y=82, label=year), cex=12, color="grey") +
  facet_grid(. ~ year) +
  theme(strip.background = element_blank(),
        strip.text.x = element_blank(),
        strip.text.y = element_blank(),
   legend.position = "top")

Data Distributions & Outliers

Detecting outliers

For this part we use a different and more simple dataset This dataset contains 1192 observations on self-reported:

  • height (inch)
  • earn ($)
  • sex (gender)
  • ed (currently unannotated)
  • age (years)
  • race
heights_data <- read_csv(file = here::here("data",
                                          "heights_outliers.csv"))

heights_data
## # A tibble: 1,192 x 6
##     earn height sex       ed   age race    
##    <dbl>  <dbl> <chr>  <dbl> <dbl> <chr>   
##  1 50000   74.4 male      16    45 white   
##  2 60000   65.5 female    16    58 white   
##  3 30000   63.6 female    16    29 white   
##  4 50000   63.1 female    16    91 other   
##  5 51000   63.4 female    17    39 white   
##  6  9000   64.4 female    15    26 white   
##  7 29000   61.7 female    12    49 white   
##  8 32000   72.7 male      17    46 white   
##  9  2000   72.0 male      15    21 hispanic
## 10 27000   72.2 male      12    26 white   
## # … with 1,182 more rows

Data characteristics

We will focus on the variable height here

summary_heights_data <- heights_data %>%
  group_by(sex, age) %>%
  summarise(mean_height = mean(height, na.rm = TRUE),
            min_height = min(height),
            max_height = max(height)) %>%
  arrange(desc(mean_height))

summary_heights_data[c(1:4),]
## # A tibble: 4 x 5
## # Groups:   sex [2]
##   sex      age mean_height min_height max_height
##   <chr>  <dbl>       <dbl>      <dbl>      <dbl>
## 1 female    55       141.        61.9      664. 
## 2 male      39       134.        66.6      572. 
## 3 male      55        73.2       71.7       74.8
## 4 male      91        73.1       73.1       73.1

From the above summary we can conclude that there are two outliers (presumably entry errors).

Calculate the height in meters for each outlier in the Console 1 inch = 0,0254 meters

Please discuss the solution with your neighbour

Checking the frequency distribution

heights_data %>%
  ggplot(aes(x = height)) +
  geom_histogram(aes(stat = "identity"), bins = 200)

This distribution looks odd. When you see a large x-axis with no data plotted on it, it usually means there is an outlier. If you look carefully, you will spot two outliers around 600

Boxplots to detect outliers

heights_data %>%
  ggplot(aes(y = height)) +
  geom_boxplot()

So apparantly there is one data point that is way off from the rest of the distribution. Let’s remove this point, using filter() from the {dplyr} package like we did before on the gapminder dataset.

heights_data %>%
  dplyr::filter(height < 100) %>%
  ggplot(aes(y = height)) +
  geom_boxplot()

## by sex

heights_data %>%
  dplyr::filter(height < 100) %>%
  ggplot(aes(y = height, x = sex)) +
  geom_boxplot()

New frequency distribution

Now let’s plot a new distribution plot, this time we plot density, leaving the outlier out

heights_data %>%
  dplyr::filter(height < 100) %>%
  ggplot(aes(height)) +
  geom_freqpoly(aes(y = ..density..))

## by sex
heights_data %>%
  dplyr::filter(height < 100) %>%
  ggplot(aes(height)) +
  geom_freqpoly(aes(y = ..density.., colour = sex))

A formal check to detect an outlier - handle with care

The best way to spot and take care of an outlier (you never delete an outlier unsubstantiated!!) is to create multiple visualizations as decribed above. However, there is a formal, simple statistical apporach to assess whether an observation is an outlier. The apporach below can only be used to detect a single outlier in a numeric vector. For more complex (multivariate) situations see e.g.: http://r-statistics.co/Outlier-Treatment-With-R.html

The Dixon’s Q Test for detecting outliers

The {outliers} package contains a number of tools. Here we will demonstrate the outlier() and dixon.test() functions. For more information on the DQT, see: https://www.statisticshowto.datasciencecentral.com/dixons-q-test/

outlier() function

This function finds the value with largest difference between it and sample mean, which can be an outlier.

We will walk trough a simple exercise that first shows the result of the outlier() function and than we calculate it by hand

First we create dummy data for a triplicate series of measurements. We assume 5 measurement values, replicated three times, containing one outlier. We also assume is was a concentration depended series in which the index indicates the used concentration.

set.seed(123)
## create data
random_numbers <- rnorm(n = 15, mean = 2.5, sd = 0.9) 
## inject outlier
random_numbers[4] <- 9.54
## define triplicates
triplicates <- rep(1:3, times = 5)
index <- rep(1:5, times = 3)

## put in dataframe
measured_data <- tibble(measured = random_numbers, replicate = triplicates,
                        index = index)
measured_data
## # A tibble: 15 x 3
##    measured replicate index
##       <dbl>     <int> <int>
##  1     2.00         1     1
##  2     2.29         2     2
##  3     3.90         3     3
##  4     9.54         1     4
##  5     2.62         2     5
##  6     4.04         3     1
##  7     2.91         1     2
##  8     1.36         2     3
##  9     1.88         3     4
## 10     2.10         1     5
## 11     3.60         2     1
## 12     2.82         3     2
## 13     2.86         1     3
## 14     2.60         2     4
## 15     2.00         3     5
## sort data
measured_data %>% 
  arrange(desc(measured))
## # A tibble: 15 x 3
##    measured replicate index
##       <dbl>     <int> <int>
##  1     9.54         1     4
##  2     4.04         3     1
##  3     3.90         3     3
##  4     3.60         2     1
##  5     2.91         1     2
##  6     2.86         1     3
##  7     2.82         3     2
##  8     2.62         2     5
##  9     2.60         2     4
## 10     2.29         2     2
## 11     2.10         1     5
## 12     2.00         3     5
## 13     2.00         1     1
## 14     1.88         3     4
## 15     1.36         2     3

Plotting the data reveals the outlier.

## plot
measured_data %>% 
#  enframe() %>%
  ggplot(aes(x = index, y = measured)) +
  geom_point() +
  geom_point(data = dplyr::filter(.data = measured_data, measured > 5), 
             colour = "red", size = 6, shape = 4)

Identify with outliers::outlier()

outliers::outlier(measured_data$measured)
## [1] 9.54

Identify with outliers::dixon.test()

## note that these functions from the {outliers} package take a vector (use $)
outliers::dixon.test(measured_data$measured)
## 
##  Dixon test for outliers
## 
## data:  measured_data$measured
## Q = 0.7472, p-value < 2.2e-16
## alternative hypothesis: highest value 9.54 is an outlier

The test is significant against the NULL hypothesis that the highest value is not an outlier. We can conclude that this NULL hypothesis does not hold on the basis of the significant p-value < 0.05

Manual calculations:

outlier()

mean_vctr <- mean(measured_data$measured)

distance <- measured_data$measured - mean_vctr

ind <- distance == max(distance)

measured_data[ind,]
## # A tibble: 1 x 3
##   measured replicate index
##      <dbl>     <int> <int>
## 1     9.54         1     4

dixon.test()

The Dixon’s Q Test can be calculated manually following the following procedure:

How to Run Dixon’s Q Test (R10).

Note: make sure your data set is normally distributed before running the test; for example, run a Shapiro-Wilk test. Running it on different distributions will lead to erroneous results. An extreme value may throw off any test for normality, so try running that test without the suspect data item. If your data set still doesn’t meet the assumption of normality after running a test for it, then you should not run Dixon’s Q Test.

Note that for very small sample sizes the assumtion for Normal distribution usually does not hold. We will see more formal way of establishing Normal distribution below.

Caution: the test should not be used more than once for the same set of data.

To perform the calculations:

Step 1: Sort your data into ascending order (smallest to largest).

Step 2 :Find the Q statistic using the following formula:

\(Qexp = \frac{x2 - x1}{xn - x1}\)

Where

  • \(x1\) is the smallest (suspect) value
  • \(x2\) is the second smallest value
  • \(xn\) is the largest value

Dixon’s Q Test: Definition, Step by Step Examples + Q Critical Values Tables

Find Outliers > Dixon’s Q Test What is Dixon’s Q Test?

Dixon’s Q test, or just the “Q Test” is a way to find outliers in very small, normally distributed, data sets. Small data sets are usually defined as somewhere between 3 and 7 items. It’s commonly used in chemistry, where data sets sometimes include one suspect observation that’s much lower or much higher than the other values. Keeping an outlier in data affects calculations like the mean and standard deviation, so true outliers should be removed.

Dixon came up with many different equations to find true outliers. The most commonly used one is called the R10 or simply the “Q” version, which is used to test if one single value is an outlier in a sample size of between 3 and 7. Dean and Dixon did suggest various other formulas in a later paper, but these are not commonly used. For a full list of alternate formulas for different sample sizes (up to about 30), go to: Dixon’s Test, Alternate Formulas and Tables. How to Run Dixon’s Q Test (R10).

Note: make sure your data set is normally distributed before running the test; for example, run a Shapiro-Wilk test. Running it on different distributions will lead to erroneous results. An extreme value may throw off any test for normality, so try running that test without the suspect data item. If your data set still doesn’t meet the assumption of normality after running a test for it, then you should not run Dixon’s Q Test,

Caution: the test should not be used more than once for the same set of data.

Step 1: Sort your data into ascending order (smallest to largest). 167, 177, 180, 181, 185, 188, 189.

Step 2 :Find the Q statistic using the following formula: dixon’s q test statistic

Where:

x1 is the smallest (suspect) value,
x2 is the second smallest value,
and xn is the largest value.

Step 3: Find the Q critical value in the Q table (see below). For a sample size of 3 and an alpha level of 5%, the critical value is 0.970.

Step 4: Compare the Q statistic from Step 2 with the Q critical value in Step 3. If the Q statistic is greater than the Q critical value, the point is an outlier.

Let’s try it on our measured_data

measured_data_ranked <- measured_data %>%
  arrange(measured) 
measured_data_ranked$measured
##  [1] 1.361445 1.881832 1.995572 1.999743 2.098904 2.292840 2.599614
##  [8] 2.616359 2.823832 2.860694 2.914825 3.601674 3.902837 4.043558
## [15] 9.540000
x1 <- measured_data_ranked$measured[1]
x2 <- measured_data_ranked$measured[2]
xn <- max(measured_data_ranked$measured)

Qexp = (x2-x1) / (xn-x1)

##
0.5/8
## [1] 0.0625

Checking normality with a qqplot

a qqplot provides a visual aid to assess whether a distribution is approaching normality

## 
source(file = here::here("code", "ggqq.R"))
height_data_outlier_removed <- heights_data %>%
  dplyr::filter(height < 100)
  gg_qq(height_data_outlier_removed$height) 

##       25%       75% 
## 66.926998  4.328462
## formal statistical proof
shapiro.test(height_data_outlier_removed$height)
## 
##  Shapiro-Wilk normality test
## 
## data:  height_data_outlier_removed$height
## W = 0.98485, p-value = 8.491e-10

all data -> reject hypothesis that the sample has a normal distribution

Test individual distributions

males <- height_data_outlier_removed %>%
  dplyr::filter(sex == "male")

females <- height_data_outlier_removed %>%
  dplyr::filter(sex == "female")

shapiro.test(males$height)
## 
##  Shapiro-Wilk normality test
## 
## data:  males$height
## W = 0.99053, p-value = 0.002532
shapiro.test(females$height)
## 
##  Shapiro-Wilk normality test
## 
## data:  females$height
## W = 0.99277, p-value = 0.002105
## add shapiro for each sex

## we can do the same for age
shapiro.test(males$age)
## 
##  Shapiro-Wilk normality test
## 
## data:  males$age
## W = 0.93358, p-value = 3.506e-14
shapiro.test(females$age)
## 
##  Shapiro-Wilk normality test
## 
## data:  females$age
## W = 0.93978, p-value = 4.862e-16